Scaled Bias Add support after CUBLAS GGEMM by vthumbe1503 · Pull Request #2885 · NVIDIA/TransformerEngine

vthumbe1503 · 2026-04-15T15:57:06Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…imized and uses scales now Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

…ed_linear_integration

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

greptile-apps · 2026-04-15T16:04:04Z

Greptile Summary

This PR adds optional per-row scale support to nvte_grouped_bias_add, enabling output[row,col] += bias[col] * scale[row] after a cuBLAS grouped GEMM. The kernel is refactored from a tensor-per-block grid to a 2D row-chunk × col-chunk grid with a shared-memory prefix-sum for tensor boundary detection, and a compile-time UseScale template flag avoids runtime branching in the inner loop. The C++ and Python extension layers are consistently updated across all three grouped-GEMM variants (grouped_tensor, discrete_in, discrete_out).

Confidence Score: 5/5

Safe to merge; all remaining findings are P2 style/improvement suggestions with no runtime risk for current callers.

The core scaled-bias logic is correctly implemented: fmaf argument order matches documented semantics, shared-memory cumsum is correctly initialized and synchronized, and the empty-tensor sentinel correctly disables scaling. The two P2 findings (dead pre-loop bias load; missing tensor_offsets guard) do not affect current callers since bias GroupedTensors are always packed in the Python bindings.

transformer_engine/common/gemm/cublaslt_grouped_gemm.cu — dead pre-loop load and missing tensor_offsets guard in nvte_grouped_bias_add.

Important Files Changed

Filename	Overview
transformer_engine/common/gemm/cublaslt_grouped_gemm.cu	Rewrites grouped_bias_add_kernel to support optional per-row scale; new 2D grid with shared-memory cumsum for tensor boundaries. Two P2 issues: dead pre-loop bias load, and bias offset computed as tensor_idx*n without validating absence of explicit tensor_offsets.
transformer_engine/pytorch/csrc/extensions/gemm.cpp	Adds bias_scale parameter to all three grouped-tensor GEMM entry points; correctly wraps it as an NVTE tensor and passes it to nvte_grouped_bias_add after GIL release, consistently across all three variants.
transformer_engine/pytorch/cpp_extensions/gemm.py	Adds bias_scale parameter to general_grouped_gemm_for_grouped_tensor; defaults to empty tensor when None, validates bias_scale requires bias, and passes through to the C++ extension.
transformer_engine/common/include/transformer_engine/gemm.h	Adds const NVTETensor scale parameter to nvte_grouped_bias_add declaration with accurate docstring describing conditional application. Clean change.
transformer_engine/pytorch/csrc/extensions.h	Adds at::Tensor bias_scale parameter to the three grouped GEMM extension function declarations, consistent with gemm.cpp implementation.
transformer_engine/pytorch/csrc/extensions/swizzle.cpp	Adds NVTE_NVFP4_1D_SCALING to the maybe_swizzle_grouped_tensor guard so FP4 grouped tensors are swizzled; mechanically correct but unrelated to the PR's stated feature.
tests/pytorch/test_numerics.py	Adds use_bias_scale parametrize to test_grouped_gemm_grouped_tensor; reference and manual-bias paths updated to compute d += bias * scale. NT layout silently excluded from bias_scale testing without explanation.

Sequence Diagram

sequenceDiagram
    participant Py as Python (gemm.py)
    participant Ext as C++ Extension (gemm.cpp)
    participant NVTE as nvte_grouped_gemm
    participant BiasKernel as nvte_grouped_bias_add

    Py->>Ext: general_grouped_gemm_for_grouped_tensor(A, B, out, bias, bias_scale)
    Ext->>Ext: prepare_grouped_gemm_config(alpha, beta, ...)
    Ext->>NVTE: nvte_grouped_gemm(A, B, C=D, D, alpha, beta, ...)
    NVTE-->>Ext: D = alpha * A @ B + beta * C
    alt bias is not None
        Ext->>BiasKernel: nvte_grouped_bias_add(D, bias, scale)
        Note over BiasKernel: Build shared cumsum for row-to-tensor map
        BiasKernel->>BiasKernel: grouped_bias_add_kernel UseScale=true/false
        Note over BiasKernel: D[row,col] += bias[col] * scale[row]
        BiasKernel-->>Ext: D updated in-place
    end
    Ext-->>Py: D (updated)

_{Reviews (3): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile}

greptile-apps · 2026-04-15T16:04:08Z

+  const size_t tensor_idx = blockIdx.y;
  if (tensor_idx >= num_tensors) return;

+  const int64_t n = d_meta.last_dims ? d_meta.last_dims[0] : d_meta.uniform_last;


Hardcoded index [0] instead of [tensor_idx]

d_meta.last_dims[0] works only because the pre-launch NVTE_CHECK(outputD->all_same_last_dim() ...) enforces a uniform last dimension. Using the hardcoded index removes the per-tensor correctness at a glance — a future reader (or a refactor that relaxes the uniform check) would not immediately see why [0] is used instead of [tensor_idx]. A comment linking this to the uniformity invariant would make this self-documenting.

Suggested change

const int64_t n = d_meta.last_dims ? d_meta.last_dims[0] : d_meta.uniform_last;

const int64_t n = d_meta.last_dims ? d_meta.last_dims[0] // uniform across tensors (checked)

: d_meta.uniform_last;

greptile-apps · 2026-04-15T16:04:09Z

+  int64_t scale_row_offset = 0;
+  if constexpr (UseScale) {
+    if (d_meta.first_dims) {
+      for (size_t i = 0; i < tensor_idx; i++) {
+        scale_row_offset += d_meta.first_dims[i];
+      }
+    } else {
+      scale_row_offset = static_cast<int64_t>(tensor_idx) * d_meta.uniform_first;
+    }
+  }


Redundant per-thread scale_row_offset loop

Every thread in the block (all 256 of them) independently computes scale_row_offset by iterating up to tensor_idx times over d_meta.first_dims. Since tensor_idx == blockIdx.y, all threads in a block produce the same value. For large num_tensors, moving this into shared memory (computed once by thread 0 and shared) would avoid the redundant iterations. The broadcast access pattern through L1 is benign for small num_tensors, but is worth noting for scalability.

greptile-apps · 2026-04-15T16:04:10Z

 std::optional<SwizzledGroupedScales> maybe_swizzle_grouped_tensor(GroupedTensorWrapper &input,
                                                                  bool rowwise_usage,
                                                                  bool columnwise_usage) {
-  if (input.scaling_mode() != NVTE_MXFP8_1D_SCALING) {
+  if (input.scaling_mode() != NVTE_MXFP8_1D_SCALING &&
+      input.scaling_mode() != NVTE_NVFP4_1D_SCALING) {
    return std::nullopt;
  }


Unrelated FP4 swizzle change — should be documented

This guard extension (adding NVTE_NVFP4_1D_SCALING) is a separate fix that enables grouped-tensor scale swizzling for FP4 inputs; it is unrelated to the Scaled Bias Add feature described in the PR title. nvte_swizzle_grouped_scaling_factors does handle FP4 in swizzle.cu, so the change is mechanically correct, but it would be helpful to document the motivation in the PR description or add a comment here explaining why FP4 also needs this path.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

for more information, see https://pre-commit.ci

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

for more information, see https://pre-commit.ci

timmoon10

My big question is whether the kernel implementation changes are providing a perf benefit.

timmoon10 · 2026-04-17T02:31:21Z

    py::handle A, bool transa, py::handle B, bool transb, py::handle D, py::object bias,
-    at::Tensor alpha, at::Tensor beta, at::Tensor workspace_setup, at::Tensor workspace_cublas,
-    bool use_split_accumulator, int math_sm_count) {
+    at::Tensor bias_scale, at::Tensor alpha, at::Tensor beta, at::Tensor workspace_setup,


We should avoid the overhead of constructing a tensor when bias_scale isn't needed. std::optional also communicates the intent more clearly.

Suggested change

at::Tensor bias_scale, at::Tensor alpha, at::Tensor beta, at::Tensor workspace_setup,

std::optional<at::Tensor> bias_scale, at::Tensor alpha, at::Tensor beta, at::Tensor workspace_setup,

timmoon10 · 2026-04-17T02:33:30Z

+    if bias_scale is None:
+        bias_scale = torch.empty(0, dtype=torch.float32, device=device)
+


We can avoid this overhead by making the tex function take an optional argument.

Suggested change

if bias_scale is None:

bias_scale = torch.empty(0, dtype=torch.float32, device=device)

timmoon10 · 2026-04-17T02:37:58Z

 void nvte_grouped_bias_add(const NVTEGroupedTensor output, const NVTEGroupedTensor bias,
-                           cudaStream_t stream);
+                           const NVTETensor scale, cudaStream_t stream);


I think it makes more sense to create a separate API for nvte_grouped_scaled_bias_add. Grouped bias is a natural generalization of linear layer biases, but grouped scaled bias is less intuitive (especially that the biases are per-group, but the scales are per-token) and it should be treated as more exotic.

timmoon10 · 2026-04-17T02:48:53Z

+import torch
+import torch.nn as nn
+from torch.nn import Parameter
+


Nit: Is there a reason we're reordering? If the import order causes problems, then that's a bug we need to fix. Otherwise, this ordering seems strangely unmotivated and haphazard. It's also considered good Python style to put third party imports before local imports (PEP 8).

timmoon10 · 2026-04-17T02:54:11Z

+  constexpr int kMaxTensors = 257;
+  __shared__ int cumsum[kMaxTensors];


The variable name is wrong.

Suggested change

constexpr int kMaxTensors = 257;

__shared__ int cumsum[kMaxTensors];

constexpr int kMaxTensors = 256;

__shared__ int cumsum[kMaxTensors + 1];

timmoon10 · 2026-04-17T03:08:48Z

+  // Binary search for the starting row's tensor.
+  int tensor_idx;
+  {
+    int lo = 0, hi = num_tensors;
+    while (lo < hi) {
+      int mid = (lo + hi) >> 1;
+      if (cumsum[mid + 1] <= row_start)
+        lo = mid + 1;
+      else
+        hi = mid;
+    }
+    tensor_idx = lo;
+  }
+  int bias_idx = tensor_idx * n;


Have we benchmarked whether this binary search is any better than just scanning through the tensors. Computing the cumsums is still O(n), so we're not improving the asymptotics. We're also introducing thread syncs and shared memory accesses.

timmoon10 · 2026-04-17T03:17:18Z

-  const auto *b_vec = reinterpret_cast<const VecType *>(bias_ptr + col);
  VecStorage b_in;
-  b_in.scratch_.aligned = *b_vec;
+  b_in.scratch_.aligned = *reinterpret_cast<const VecType *>(bias + bias_idx + col);


This value is immediately wiped out in the loop. I guess the compiler might be smart enough not to do an unnecessary memory access, but it makes the code harder to read.

vthumbe1503 added 3 commits April 13, 2026 16:16

starting grouped linear integration, not tested, grouped_bias_add opt…

ba6f5eb

…imized and uses scales now Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into group…

cb0504d

…ed_linear_integration

all changes

2a344b2

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

vthumbe1503 changed the title ~~Scaled Bias Add support at the end of CUBLAS GGEMM~~ Scaled Bias Add support after CUBLAS GGEMM Apr 15, 2026

greptile-apps bot reviewed Apr 15, 2026

View reviewed changes

[pre-commit.ci] auto fixes from pre-commit.com hooks

9f98357

for more information, see https://pre-commit.ci

vthumbe1503 mentioned this pull request Apr 15, 2026

Grouped Bias/Dbias Kernel Support After Grouped GEMM #2766

Open

vthumbe1503 and others added 2 commits April 16, 2026 19:27

optimize grouped bias add kernel to 4TB/s handling load imbalance

c565367

Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b64559a

for more information, see https://pre-commit.ci

timmoon10 requested changes Apr 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaled Bias Add support after CUBLAS GGEMM#2885

Scaled Bias Add support after CUBLAS GGEMM#2885
vthumbe1503 wants to merge 6 commits intoNVIDIA:mainfrom
vthumbe1503:scaled_bias_ggemm

vthumbe1503 commented Apr 15, 2026

Uh oh!

greptile-apps bot commented Apr 15, 2026 •

edited

Loading

Uh oh!

greptile-apps bot Apr 15, 2026

Uh oh!

greptile-apps bot Apr 15, 2026

Uh oh!

greptile-apps bot Apr 15, 2026

Uh oh!

timmoon10 left a comment

Uh oh!

timmoon10 Apr 17, 2026

Uh oh!

timmoon10 Apr 17, 2026

Uh oh!

timmoon10 Apr 17, 2026

Uh oh!

timmoon10 Apr 17, 2026

Uh oh!

timmoon10 Apr 17, 2026

Uh oh!

timmoon10 Apr 17, 2026

Uh oh!

timmoon10 Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	const int64_t n = d_meta.last_dims ? d_meta.last_dims[0] : d_meta.uniform_last;
	const int64_t n = d_meta.last_dims ? d_meta.last_dims[0] // uniform across tensors (checked)
	: d_meta.uniform_last;

	at::Tensor bias_scale, at::Tensor alpha, at::Tensor beta, at::Tensor workspace_setup,
	std::optional<at::Tensor> bias_scale, at::Tensor alpha, at::Tensor beta, at::Tensor workspace_setup,

		if bias_scale is None:
		bias_scale = torch.empty(0, dtype=torch.float32, device=device)

		constexpr int kMaxTensors = 257;
		__shared__ int cumsum[kMaxTensors];

Conversation

vthumbe1503 commented Apr 15, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

timmoon10 Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Apr 15, 2026 •

edited

Loading